This practice aims to compare Polars and Pandas, Matplotlib and Plotly, representing an ultimate exercise in exploring diverse tools for data manipulation and visualization. While Pandas and Matplotlib are staples in data analysis, Plotly and Polars emerge as new and powerful alternatives, each offering unique capabilities suited to modern data science challenges. Therefore, this practice would be focus on EDA instead of ML methods.

Import Libraries

Load data

When comparing Polars and Pandas, notable distinctions emerge: Polars displays dtypes within the table itself while omitting row indices. In contrast, Pandas typically reveals a summary row count and column dimensions (e.g., "5 rows × 31 columns") at the footer of the output.

Data Preprocess

After cleaning the data, all entries remain intact, indicating the absence of empty data in our dataset. In Polars, it's notable that even for columns containing strings, data is shown in the describe function output.

EDA for univariate analysis

All visualizations will utilize the Polars DataFrame for the following analysis.

Import Libraries for analysis

Age Distribution

Even in univariate analysis, it is evident that Plotly surpasses Matplotlib in terms of data readability. With Plotly, simply hovering the mouse reveals precise numerical values without requiring any aggregation.

Income Distribution

For this dataset, the distribution of income levels is fairly balanced.

Gender Distribution

Currently, Seaborn primarily relies on pandas or similar data structures for data visualization. Polars, on the other hand, is a fast data manipulation library designed for big data, with data structures that differ somewhat from pandas. As a result, Seaborn does not directly support Polars data structures as input.

Excluding the other category, this dataset shows a higher proportion of males than females.

Distribution of Location

This dataset primarily focuses on individuals from India, the US, and several other countries. Therefore, the insights derived from this data are particularly relevant to populations in these regions.

Distribution of Platforms

The dataset collects data from four different platforms: TikTok, Instagram, YouTube, and Facebook. The distribution of data among these platforms is fairly balanced.

I will leverage Plotly exclusively to explore additional variables such as Profession, Frequency, Productivity Loss, Current Activity, and OS.

Debt Ration

Owns Property Ration